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Abstract 

The problem of discrete universal filtering, in which the components of a discrete signal emitted by an un- 
known source and corrupted by a known DMC are to be causally estimated, is considered. A family of filters are 
derived, and are shown to be universally asymptotically optimal in the sense of achieving the optimum filtering 
performance when the clean signal is stationary, ergodic, and satisfies an additional mild positivity condition. Our 
schemes are comprised of approximating the noisy signal using a hidden Markov process (HMP) via maximum- 
likelihood (ML) estimation, followed by the use of the forward recursions for HMP state estimation. It is shown 
that as the data length increases, and as the number of states in the HMP approximation increases, our family of 
filters attain the performance of the optimal distribution-dependent filter. 

Index Terms - Universal filtering, finite alphabet, hidden Markov process (HMP), stochastic setting, random- 
ized scheme, forward-backward recursion state estimation, ML parameter estimation 

1 Introduction 

The problem of estimating a discrete-time, finite-alphabet source signal {X t } t ^T from the entire observation of a 
noisy signal {Z t } t ^T, which has been corrupted by a known discrete memoryless channel (DMC), has been thoroughly 
studied recently in |21j . It has been shown that even though the source distribution is unknown, an algorithm called 
DUDE can universally achieve the asymptotically optimal performance. This result has been extended in various 
directions such as the case of channel uncertainty [Sj, the case where the channel has memory the case of 
non-discrete noisy signal components [B], and the case where the reconstruction is required to depend causally on 
the noisy signal In this paper, we revisit the last case, taking a different approach from |18 | |19 | . 

The case where we estimate X t causally based on observation of the noisy signal Z l = [Z\, ■ ■ ■ , Z t ), is referred to 
as filtering. The filter can be either deterministic or randomized (a concept that will be explained in detail later) . In 
this paper, we will only focus on the stochastic setting, where we assume {X t } is a stationary and ergodic stochastic 
process. With the stochastic setting assumption, and under the same performance criterion of |21| . i.e., minimizing 
the expected normalized cumulative loss, knowledge of the conditional distribution of X t given Z l at each time t is 
required to achieve the optimal performance. Also, by the same argument as in Section III], this conditional 
distribution can be obtained by the conditional distribution of Z t given Z* _1 when the invertible DMC is known. 
(We call a channel is invertible if its transition probability matrix is invertible.) 

However, for the universal filtering setting, where the probability distribution of the source is unknown, the 
conditional distribution of Z t given Z*" 1 is also not known and need be learned from the observed noisy signal. 
Therefore, if we can learn this conditional distribution accurately as the observation length increases, we can hope 
to build the universal filtering scheme that achieves the asymptotically optimal performance from the estimated 
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conditional distribution. To pursue this goal, QHlCSi adopt the universal prediction^] approach. That is, they first 
get an estimate of the conditional distribution of Z t given Z l ~ x by employing a universal predictor for the observed 
noisy signal, and then by inverting the known DMC, obtain an estimate of the conditional distribution of X t given 

ZK 

Unlike the approach of ^B] QUI > m this work, we turn our attention to the rich theory of hidden Markov process 
(HMP) models to directly obtain a different kind of estimate of the conditional distribution of X t given Z*, without 
going through the channel inversion stage. 

Generally, HMPs are defined as a family of stochastic processes that are outputs of a memoryless channel whose 
inputs are finite state Markov chains. As can be seen in [7|, these HMP models arise in many areas, such as information 
theory, communications, statistics, learning, and speech recognition. Among these applications of HMPs, there are 
many situations where the state of the underlying Markov chain need be estimated based on the observed hidden 
Markov process. If the exact parameters of the HMP, namely, the state transition probability of the Markov chain 
and the channel transition density, as well as the order of the Markov chain are known, then this problem can be 
easily solved via well-known forward-backward recursions which were discovered by 0] and • Especially, when we 
are estimating the state based on the causal observation of the HMP, we only need the forward recursion formula. 
In addition, much work has been done for the state estimation, where the order is known, but the parameters of 
the HMP are unknown. In this case, the parameters are first estimated via maximum likelihood (ML) estimation 
or the EM algorithm, then the state is estimated by using the estimated parameters in the recursion formula. A 
detailed explanation of this approach and the property of the ML parameter estimation can be found in |2] El • 
Furthermore, this was extended to the case where the order of the Markov chain is also not known, but the upper 
bound on the order is known. In this case, the order estimation is first performed before the parameter and state 
estimation, and the above process is repeated. The references for the order estimation are given in There 
also has been work for the case where even the knowledge of the upper bound on the order of the Markov chain is 
not required |Sj 123- 

From these rich theories for the state, parameter, and order estimation of HMPs, we can see that it is possible to 
build a universal filtering scheme if the source distribution is known to be a finite state Markov chain. That is, since 
the channel is memoryless and fixed in our setting, if our source {X t } is a finite state Markov chain, then obviously, 
{Zt} is an HMP, and we can first estimate the order of the Markov chain, then estimate the parameter, and finally 
perform forward recursion to learn the conditional distribution of X t given Z l . From the consistency results of order 
estimation and parameter estimation, this conditional distribution will be an accurate estimate of the true one, and 
we can use it to build the universal filtering scheme. 

Now, in our work, we extend this approach to the case where our source {X t } is a general stationary and ergodic 
process (with some benign conditions), which need not be a Markov source at all, and show that we can still build 
a universal filtering scheme that achieves asymptotically optimal performance. The skeleton of our scheme is the 
following: We first "model" our source as a finite state Markov chain with a certain order, or equivalently, model 
the noisy observed signal {Z t } as an HMP in a certain class. Then, we estimate the parameters of the HMP that 
"approximates" the noisy signal best in that class. We will show that from the consistency result about the ML 
parameter estimation for the mismatched model [H|, these estimated parameters will give an accurate estimation of 
the conditional distribution of X t given Z l , as the observation length increases and the HMP class gets richer. Then, 
this result will guarantee that our universal filter using this conditional distribution will attain the asymptotically 
optimal performance. In practice, this approach has been hcuristically employed in many applications for nonlinear 
filtering without theoretical justification. Therefore, this work shows the first theoretical proof on the justification 
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of the HMP-based universal filtering scheme. 

The remainder of the paper is organized as follows. Section [21 introduces some notation and preliminaries that 
are needed for setting up the problem. In Section |3 the universal filtering problem is defined explicitly. In Section 

01 our universal filtering scheme is devised, the main theorem is stated, and proved. Section |5l extends our approach 
to the case where the channel has memory. Section concludes the paper and lists some related future directions. 
Detailed technical proofs that are needed in the course of proving our main results are given in the Appendix. 

2 Notation and preliminaries 

2-A General notation 

We assume that the clean, noisy and reconstruction signal components take their values in the same finite M-ary 
alphabet A = {0, • • • , M — 1}. The simplex of M-dimensional column probability vectors will be denoted as Ad. 

The DMC is known to the filter and is denoted by its transition probability matrix II = {IT(i, .jeA- Here, 
II(j,j) denotes the probability of channel output symbol j when the input is i. We assume LI(i, j) > and let 
n m in = rninij j)- We assume this channel matrix is invertible and denote the inverse as IT" 1 . Let IT" 1 denote 
the i-th column of II -1 . We also assume a given loss function (fidelity criterion) A : A 2 — > [0, oo), represented by 
the loss matrix A = {A-{i>j)}i,jeA> where A(i,j) denotes the loss incurred when estimating the symbol i with the 
symbol j. The maximum single-letter loss will be denoted by A max = max,-j e ^ A(i,j), and Xj will denote the j-th 
column of A. 

As in , we define the extended Bayes response associated with the loss matrix A to any column vector V £ R M 

as 

BCV) = argminATv, 

where argmin^g^ denotes the minimizing argument, resolving ties by taking the letter in the alphabet with the 
lowest index. 

We let P denote the true joint probability law of the clean and noisy signal, and E(-) denote expectation with 
respect to P. Also, every almost sure convergence is with respect to P. If we need to refer to the probability law 
of clean or noisy signal induced by P, we denote Px and Pz, respectively. If P is written in a bold face, P, with 
a subscript, it stands for a simplex vector in Ad for the corresponding distribution of the subscript. For example, 
Px t |z* is a column M-vector whose i-th component is P(X t = i\Z l = z l ). 

When we have some other probability law denoted as Q, and want to measure its difference from P, a natural 
choice of such a measure is the relative entropy rate. First, denote the n-th order relative entropy between P and Q 
as 

D„(p||g)=i;p(,»).o g a2 = i S (.o 6 ffi2). 

Then, the relative entropy rate (also known as Kullback-Leibler divergence rate) is defined as 

D(P||Q) 4 lhn -D n {P\\Q) 

n — >oc ft 

if the limit exists. When Q is a probability law in a certain class of HMPs, this limit always exists and the relative 
entropy rate is well defined. A more detailed discussion about this limit will be given in Lemma [21 This relative 
entropy rate will play a central role in analyzing our universal filtering scheme. 
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2-B Hidden Markov Processes (HMPs) 
2-B.l Definition 

As stated in the Introduction, the HMPs are generally defined as a family of stochastic processes that are outputs 
of a memoryless channel whose inputs are finite state Markov chains. Throughout the paper, we will only consider 
the case in which the alphabet of HMP, Z, and underlying Markov chain, X, are finite and equal, i.e., Z = X = A 1 
and the channel is DMC and invertible. 

There are three parameters that determine the probability laws of HMP: ir, the initial distribution of finite state 
Markov chain; A, the probability transition matrix of finite state Markov chain, and B, the probability transition 
matrix of DMC. The triplet {tt, A, B} is referred to as the parameter of HMP. Let O be a set of all 6's where 
9 := {-Kg, Ag, Bg}. For each 9, we can calculate the likelihood function 

n 

Q e (z n ) = n e l[(Be,tAg)l, 

t=i 

where B^ t isMxM diagonal matrix whose (j, j)-th entry is the (j, z t )-th entry of Bg, and 1 is the M x 1 vector 
with all entries equal to 1. 

Now, let 0fc C be a set of 0's, such that the order of underlying Markov chain of HMP is k. Furthermore, for 
some 6 > 0, define & s k C 0fc as the set of 9 e 0& satisfying: 

• o-ijfi > S, if the first k — 1 components of the fc-tuple state j are equal to the last k — 1 components of fc-tuple 
state i 

• o-ij,6 = 0, otherwise 

• bijfi = n(i, j), for Vi, j, 

where a%jfi is (i,j)-th entry of Ag, and fe^g is (i,j)-th entry of In particular, if 9 e 0^ then: 1) the stochastic 
matrix Ag is irreducible and aperiodic; thus, if the Markov chain is stationary, ixg is the stationary distribution of 
the Markov chain, and is uniquely determined from Ag, 2) Bg = II V0, and, therefore, is completely specified by 
Ag. For notational brevity, we omit the subscript 9 and denote the probability law Q £ & s k , if Q = Qg, and 9 <G Q s k . 

2-B. 2 Maximum likelihood (ML) estimation 

Generally, suppose a probability law Q is in a certain class f2. Then, the ra-th order maximum likelihood (ML) 
estimator in £1 for the observed sequence z n , is defined as 

Q[z n ] =argmaxQ(z"), 

resolving ties arbitrarily. Now, if Q G ©£, then there is an algorithm called expectation-maximization(EM) [4] that 
iteratively updates the parameter estimates to maximize the likelihood. Thus, when Q is in the class of probability 
laws of a HMP, the maximum likelihood estimate can be efficiently attained. 1 We denote the ML estimator in 0^ 
based on z n by 

QkA zU \ = ar s max <3( z ")- 

Qee s k 

Obviously, when the n-tuple Z n is random, Qk,s[Z n ] is also a random probability law that is a function of Z n . 

1 We neglect issues of convergence of the EM algorithm and assume that the ML estimation is performed perfectly. 
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2-B.3 Consistency of ML estimator 

When P z € Q 5 k , an ML estimator Q k ,s[ zn \ 

is said to be strongly consistent if 
lim Q k , S [Z n ] =P Z a.s. 

n — >oo 

The strong consistency of the ML estimator Qk,s[Z n ] of the parameter of a finite-alphabet stationary ergodic HMP 
was proved in pQ. For the case of a general stationary ergodic HMP, the strong consistency was proved in |12| . 

We also have a sense of strong consistency for the case where Pz is a general stationary and ergodic process. By 
the similar argument as in 8, Theorem 2.2.1], we have the consistency in the sense that if the observed noisy signal 
is not necessarily a HMP, and we still perform the ML estimation in Q s k , then we get 

lim Q k .s[Z n ] S TV a.s., (1) 

n — >oo 

where N = {Q G <d s k : D(P||<3) = ming' ee s D(P||Q )}. 2 This second consistency result is the key result that we 
will use in devising and analyzing our universal filtering scheme. 

3 The universal filtering problem 

As mentioned in the Introduction, we will assume a stochastic setting, that is, the underlying clean signal is an 
output of some stationary and ergodic process whose probability law is Px- From Px and II, we can get the true 
joint probability law P and corresponding probability law of noisy observed signal, Pz- That is, 

n 

P(X n = x n ,Z n = z n ) = P x (X n =x n )Y[lL(xi,Zi), and 

t=i 

P z {Z n = z n ) = ^P{X n = x n ,Z n = z n ). 

A filter is a sequence of probability distributions X = {A t }, where X t : A 1 — + M. The interpretation is that, upon 
observing z , the reconstruction for the underlying, unobserved Xt is represented by the symbol x with probability 
A t (z*)[i]. A filter is called deterministic if X^z 1 ) is a unit vector in R M for all t and z', and randomized if A t (z*) 
can be a simplex vector in M. other than a unit vector for some t and z l ' . The normalized cumulative loss of the 
scheme X on the individual pair (x n , z n ) is defined by 

n 

where £(x tl X t (z t )) = A(xt, i)A 4 (z*)[i:]. Then, the goal of a filter is to minimize the expected normalized 

cumulative loss E^Ly^{X n , Z n )j . 

The optimal performance of the n-th order filter is defined as 

4> n {Px,Tl) = mmE(L ± (X n ,Z n )), 
where T denotes the class of all filters. Sub-additivity arguments similar to those in 21 imply 

lim 0„(Px,n) = inf 4> n (Px,n) 4 *(p x ,n). 

n^oo n>l 

2 Just as in Theorem 2.2.1], the notion of a.s. set convergence is used. For any subset £ S ©, define £ E = {Q G : d(Q,£) < e}, 
where d is the Euclidean distance. Then, linin—Kx, Q[Z n ] 6 £ a.s. if Ve > 0, 37V(e, oj) such that Vn > N(e, oj), Q[Z n ] S £e 
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By definition, &(Px,U.) is the (distribution-dependent) optimal asymptotic filtering performance attainable when 
the clean signal is generated by the law Px and corrupted by II. This §>(Px,H) can be achieved by the optimal 
filter Xp = {Xpj.} where 

X P , t {z t ){x]=Pr{B{V XAzt )=x). 

For brevity of notation, we denote Xp(z t ) = Xp^{z t ). Note that this is a deterministic filter, i.e., for a given z l , the 
filter is a unit vector in R M for all t. We can easily see that this filter is optimal since it minimizes E(£(Xt,X(Z t )) 
for all i, and thus, it minimizes E^Ly^{X n , Z n )j for all n. 

As can be seen, Xp(z t ) needs the exact knowledge of Px t | z ', and thus, is dependent on the distribution of the 
underlying clean signal. The universal filtering problem is to construct (possibly a sequence of) filter(s), ~K un i V , that 
is independent of the distribution of underlying clean signal, Px, and yet asymptotically achieving <fr(Px,n). We 
describe our sequence of universal filters in the next section. 

4 Universal filtering based on hidden Markov modeling 
4-A Description of the filter 

Before describing our sequence of universal filters, we make the following assumption on the source. 

Assumption 1 There exists a sequence of positive reals {5k}, such that 6k j as k — > oo, and Px satisfies 

P X (X \XZI) > 5k a.s. VfceN. (2) 

For any probability law Q, we construct a randomized filer as follows: For e > 0, denote L 2 e-ball in R M as 
B t = {V e K M : ||V|| 2 < e}. Then, we define a filter for fixed e as 

X^ t (z*)[x] = Pr(B(Q Xtlzt + U) = x), (3) 

where U € R M is a random vector, uniformly distributed in B c . For brevity of notation, we denote Aq(z*) = Xq t (z r ). 
This filter is randomized since depending on Q and z*, Xq{z 1 ) can be a probability simplex vector in M. that is not 
a unit vector. The reason we needed this randomization will bc explained in proving Lemma [3] 

To devise our filter, let's first consider an increasing sequence of positive integers, {rnj}i>i, that satisfies following 
conditions: 

lim — % —^- = 0, lim m, = 00. (4) 

woo mi i^oo 

Now, define 

i(t) = max{i : < t}. 

Then, given that our source distribution satisfies J3J), and for fixed fc, define a random probability law 

Ql ^Q Mfc [Z m ^>] = arg max Q(Z m ^). (5) 
Qeei h 

That is, Q k is the ML estimator in 8^ fe based on . As discussed in Section T2-B. II we only need to estimate the 

state transition probabilities of the underlying Markov chain to obtain this ML estimator, and this can be efficiently 
done by the Expectation-Maximization (EM) algorithm. Once we get Qj,, we can then calculate Q kXt \ z t using the 
forward- recursion formula which is described in detail in [4] . Note that we get this conditional distribution directly, 
not by first estimating the output distribution, and then inverting the channel, as was done in |18 | |19 | PT]. 
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Finally, we take as our sequence of universal filtering schemes, indexed by k and e, 
The following theorem states the main result of this paper. 

Theorem 1 Let X°° € A°° be a stationary, ergodic process emitted by the source Px which satisfies Assumption 1. 
Let Z°° G A°° be the output of the DMC, II, whose input is X°°. Then: 

(a) lim e _ > olimfe^oolimsup„_ 00 L- ! > e {X n , Z n ) < *(Px, n) a.s. 

uni-v, k 

(b) lim^olimfe^oolimsup^^S^x^J^"^")) = $>(Px,Il) 
4-B Intuition behind the scheme and proof sketch 

The intuition behind our scheme parallels that of the universal compression and universal prediction problems in the 
stochastic setting. In the n-th order problem of both cases [SJ^I], the excess expected codeword length per symbol 
and the excess expected normalized cumulative loss incurred by using the wrong probability law Q in place of the 
true probability law P could be upper bounded by the normalized n-th order relative entropy -^D n (P\\Q). Then, 
to achieve the asymptotically optimum performance, the compressor and the predictor try to find and use some 
data-dependent Q that makes -^D n (P\\Q) — > as n — > oo, that is, makes D(P||Q) zero. 

We follow the same intuition in our universal filtering problem. For fixed k and e, our scheme, as can be seen 
from divides the noisy observed signal into sub-blocks of length (m; — rrii-i). Since *~ x tends to zero as i — > oo, 
the length of each sub-block grows faster than exponential. Now, to filter each sub-block, it plugs the ML estimator 
in Q s k k obtained from the entire observation of noisy signal up to the previous sub-block. From (JIJ, we know that 
as the observation length n increases, this ML estimator will converge to the parameter that minimizes the relative 
entropy rate between the true output probability law Pz- Then, to show that this scheme achieves the asymptotically 
optimum performance, we bound the excess expected normalized cumulative loss with this relative entropy rate, and 
show that the bound goes to zero as the HMP parameter set becomes richer, that is, k increases. 

To be more specific, we briefly sketch the proof of our main theorem. Part (b) of Theorem^states that our scheme 
is asymptotically optimal. We can easily see that this follows directly from Part (a) and Reverse Fatou's Lemma. 
Therefore, proving Part (a) is the key in proving the theorem. Part (a) states that in the limit, the normalized 
cumulative loss of our scheme, for almost every realization, is less than or equal to the asymptotically optimum 
performance. 

To prove Part (a), we first fix k and e, and get the following inequality 

limsupfi x « {X n ,Z n )-<j> n (P x ,Tl)) <F(limsupD(P z ||Q t fe ),e) a.s., (6) 

where F(x, y) is some function such that F(x, y) — > as x I 0, and then y [ 0. 3 There are two keys in getting this 
inequality. The first one is to show the concentration of L-j> 5 (X n , Z n ) to its expectation which will be shown in 

u n i v , k 

Lemma [3] and Corollary ^ The second is to get the explicit upper bound function F(x,y) which will be based on 
Lemma 0] Once establishing this inequality, we show that 

lim limsupD(P z ||Q' fc ) = a.s., (7) 

fc— «X> t—xx) 

3 Note that Qj_ in D(P^||Qj_) is a function of Z mi <*) , and thus, is random. A more formal definition of relative entropy rate between 
true and the random probability law like this case will be given after Lemma PI 
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from LemmaEland then send e J. to get Part (a). Keeping this proof sketch in mind, let us move on to the detailed 
proof in the next section. 

4-C Proof of the theorem 

Before proving the theorem, we introduce several lemmas as building blocks. Lemma^and Lemma|21below give some 
general results for the HMPs that we are considering. Our lemmas are similar to |H1 Lemma 2.3.4] and Theorem 
2.3.3]. The latter assumed that all the parameters are lower bounded by 5 > 0, whereas in 8^, some parameters can 
be zero. We take this into account in proving Lemma n an d Lemma |21 Lemma [3] shows the uniform concentration 
property of the normalized cumulative loss on Q k , which is an important property that we need to prove the main 
theorem. Lemma 0] provides a key step to get the upper bound described in 10, and Lemma [5] which needs three 
additional definitions, enables to show J7J). After building up the lemmas, we give the proof of the main theorem, 
which is merely an application of the lemmas. 

Lemma 1 Suppose Q £E Q k and fix S > 0. Then, Vlj, Q(Z§\ZZ_\) converges to a limit Q(Zo\Zzl ) uniformly on <d k . 

Proof: To prove this lemma, we need three more lemmas in Appendix 1, which are variations on those found in 
[J. Let's denote f t := Q(Z \Zzl), and f = 0. Then, the sequence {ft} uniformly converges on 6£, if following k 
subsequences, 

{fjk+hj = 0,1,2,- •• ,}, < I < fe- 1, 

uniformly converge on 8^, and have the same limit. 

First, the uniform convergence of each subsequence {fjk+l} c & n be shown by showing the series X)*=o(/(j+i)fe+' — 
fjk+i) converges absolutely. From Lemma [S] in Appendix 1, setting m = k, 

t 

j+l)k+l — fjk+l | 
t 

x j=l 

t 

<Mj2(ps,k,kY +1 - 

3 = 1 

Since ps,k,k < 1,M < oo and ps.k.k does not depend on Q, uj, and /, we conclude that all k subsequences converge 
uniformly on Q s k . 

Now, to show that the k subsequences have the same limit, construct another subsequence, {fj[k+i)+i,j = 
0, 1,2, ••• ,}. Since this subsequence contains infinitely many terms from all k subsequences, if this subsequence 
converges uniformly on O^, we can conclude that the k subsequences have the same limit. The derivation of the 
uniform convergence of this subsequence is the same as that described above, but setting m = k + 1 in Lemma 8. 
Therefore, the original sequence {ft} converges to its limit uniformly on Q k . M 

The remarkable fact of this lemma is that the convergence is not only uniform on Q k , but also in oj. That is, the 
convergence holds uniformly on every realization of zL^. 

Lemma 2 For the distribution of the observed noisy process {Z t }, Pz, and every Q £ Q k , 

T>(P Z \\Q)A km -D n (P z \\Q) =E(\og P ^} Z I°°l ). 

n^co n V Q(Z \Z_ i oo ) ) 
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Moreover, 



1 Pz(Z n ) 

lim - log = T>(P Z \\Q) a.s. uniformly on e 5 k . 

71 >oo n \^[Zj ) 



Proof: This lemma consists of three parts. The first part is to show the existence of the first limit in the 
lemma so that the definition of D(Pz\\Q) is valid. The second part is to show that the value of the limit is indeed 
' Finally, the last part is to show the uniform convergence of normalized log-likelihood ratio to 



^(log- 



Q(Zo\zil) 

the relative entropy rate. The first two parts and the pointwise convergence of the third part is a generalization of 
the Shannon-McMillan-Breiman theorem. The proof of these parts is identical to those in [HI Theorem 2.3.3] even 
for the case where some parameters in can be zero. 

The uniform convergence in the third part of the lemma is crucial in that it enables to obtain the second 
consistency result Q as in [5] Theorem 2.2.1]. We take into account our parameter set, and repeat the argument of 
[5| Lemma 2.4.1]. To show the uniform convergence, we need to show 



lim -\ogQ(Z n ) = e( logg(Z |Z° oo ) N ) a.s. uniformly on 6 



Since the pointwise convergence can be shown and the parameter set 05? is compact, it is enough to show that 
~ \ogQ(Z n ) is an equicontinuous sequence by Ascoli's Theorem. That is, we need to show for Ve > 0, 3<5(e) > such 
that 



Vra, 



-\ogQ(Z n )--\ogQ'(Z n ) 
n n 



<e, if ||Q -Q ||i < 5(e), 



where \\Q — Q ||i — J2i j \ a v ~ a ij \ ^ s defined to be the L\ distance between the two parameters defining Q and Q . 
This equicontinuity can be proved by observing that a process {St — (Xj._ ( ^ k _ 1 y Z t )} is a Markov process, where 
{S t } has a state space S = A k x A. This is true since 

QiSt+ilS') =Q(X t+1) Z t+1 \X\Z t ) 

=Q{X t+l \X\ Z t )Q{Z t+ i\X t+ \Z t ) 
=Q(X t+ i\X t _( k _ 1 * ) )Il(Xt + i ) Z t+1 ) 
=Q(S t+ i\St). 

Let {x\(i) : i = 1, • • • , M k } denote the set of all possible fc-tuples of {X t }, and let s = (xi(i), z), s — (xi(j),z). 
Then, the transition matrix T of {St} has elements t sS = Q(St+i = s\St = s) = aijH(xk(j), z). Since all A that are 
in are irreducible and aperiodic and IL(xk(j), z) > 0, Vxfc(j), z, T is also irreducible and aperiodic. Hence, T has 
the unique stationary distribution r. Although there are zeros in T, by the construction, any n-tuple s n has positive 
probability. Since {St} is also stationary, we have 

n-1 

Q(S n = s") = T S1 Y\ t StSt+1 = T Sl Y\ ^s| S i 
t=k (s,s) 

where 

n S s — l(St = s, St+i = s). 

t=k 
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For another probability law Q G Qt, we have 



|-logQ(S")-ilogQ'(S" 
n n 



< I - log r ai - - log t' s J + | - Y] n sS log t sS - - V n sS log t aS \ 



(s,s) 



(M) 



<|logr Sl - log r^J + I lo S^5 ~ !og4 



(s,S) 



(8) 
(9) 



= 1 log t Si - log t' Si I + ^2 I lo g «y - lo g a 'ij I 

where © is from the fact that i < l, 2 ^ < 1, and © is from the fact that DMC, II, is equal for Q and Q . The 
summations are over the pairs that have nonzero transition probabilities. 

Since the function f(x) — log a; is a uniformly continuous function for 5 < x < 1, and > <5 that occur in the 
summation, we have for e > 0, 

^ | log ay - log a^- 1 < | if ||Q-Q'||i<*i(e). 

Also, we know that all the elements of the stationary distribution of T are bounded away from zero, since the largest 
element of the stationary distribution of T is lower bounded by M l+± , and any state can be reached by finite number 
of steps whose transition probabilities are bounded away from zero. Therefore, for some C\ < oo, we have, 

I log T S1 - log T S1 | < Cl \T S1 - T si I . 

Then, from the result of the sensitivity of the stationary distribution of a Markov chain ^U], for some C2 < 00, we 
have, 

\t s i -t' s1 \ < C 2 ^2 - Cl = C2 Y " 



( S ,s) 



Hence, for e > 0, we obtain, 



|logr fll -logr Sl | < - if \\Q-Q\\i<S 2 (e). 



Therefore, by letting 5(e) = min(<$i(e), 62(e)), we have 

-logQ(5")--logQ'(5") 
n n 

Let us now go back to the original process Z. From 



<e if \\Q-Q ||i <S(e). 



-logQ(S n )--\ogQ'(S n ) 
n n 



< e, 



we have 



thus, 



Q (X n ,Z n ) < exp(ne)Q(X n ,Z n ), 

Q(Z n ) = ^Q'(a; n ,Z n ) < exp(ne) ^ Q(x n , Z n ) 
= cxp(ne)Q(Z n ) 

where the summations are again over the sequences that have nonzero probabilities. By changing the role of Q, and 
Q , we get the result that ^ log Q(Z n ) is an equicontinuous sequence. Therefore, we have the uniform convergence 
of the lemma. ■ 
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Lemma 3 ( Uniform Concentration) Suppose Q £ Q s k for some fixed 5 > 0. Let Xg be the randomized filter defined 
in 0). Then, 

lim (L ±c (X n , Z n ) - E(L ±h (X n , = a.s. uniformly on O s k 

Proof: This lemma shows the uniform concentration property of L-^ c (X n ,Z n ). The randomization of the filter is 
needed to deal with ties occur in deciding the Bayes response. A detailed proof of this lemma is given in Appendix 
2. 

Lemma 4 (Continuity) Consider a single letter filtering setting. Suppoes Q is some other joint probability law of 
X and Z. Define single letter filters Xp(z) and Xq(z) as 

Xp(z)[x] =Pr(B(P xlz )=x) 
X' Q (z)[x] =Pr(B(Q xlz + U)=x), 

where U € R M is a uniform random vector in B e as before. Then, 

E(t(X,X* Q (Z))) -E(e(X,X P (Z))) < A max K u ■ ||P Z - Q z || a + C A ■ e, 

where the expectations on the left hand side of the inequality are under P and Kn — 2»=i 1 1 1 1 1 2 T and C\ = 
max ai6e ^ || A„ - A 6 1| 2 . 

This lemma states that the excess expected loss of a randomized filter optimized for a mismatched probability law 
can be upper bounded by the L\ difference between the true and the mismatched probability laws of output symbol, 
plus a small constant term which diminishes with the randomization probability. This is somewhat analogous to a 
for the prediction which was derived in ^] (20)]. 

Proof of Lemma\Q Define X Q (z)[x] = Pr(B(Q xlz ) = x). Then, 

E(t{X,X* Q {Z))) -E(l(X,Xp(Z))) 
= P{x, z) (£(x, X' Q [z)) - i{x, Xp{z))) 

X,Z 

< ]T (Q(x, z) + \P(x, z) - Q(x, z)|) (t(x, X Q (z)) - i(x, X P {z)) + i(x, X e Q (z)) - £(x, X Q {z))) 

x,z 

< \P{x, z) - Q(x, z)\ ■ (l{x, X Q (z)) - £(x, X P (z))) (10) 



+ (Q(x, *) + \ p ( x > *) - Q(*, z)\) ■ (i(x, X' Q {z)) ~ £(x, X Q (z))) 

x.z 

= \P{x, z) - Q(x, z)\ ■ (i(x, X' Q (z)) - t{x, X P {z))) + J2 Q(?, z) (t{x, X* Q (z)) - £(x, X Q (z)) 

x,z x,z 

<A max J2 \P(x, z) - Q(x, z)\+J2 Q{x, z) (l{x, X* Q (z)) - £(x, X Q (z)j) , (12) 

IT™ 7. T. 7. 



(11) 
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where (|1(JH is from the fact that ^2 X Q(x, z)(£(x, Xq(z)) — l[x, Xp(z))) < and is from rearranging terms in 
the summation. Now, let's bound the first term in (|12l) . 

A max 

X,Z 

=A max ^ \p(x) - Q( x ) i ( ]r n(x, z y 

X Z 

=A max ^2\P(x) - Q{x)\ (13) 

X 

i 

f^W^Hnr 1 ^. ||Pz-Qz||a (14) 

i 

<A max iT n • \\Pz - Qz\\i, (15) 

where (|13f) is from the fact that ^ z n(x, z) = 1, (|14|) is from Cauchy-Schwartz inequality, and (|15|) is from the fact 
that L2-norm is less than or equal to Li-norm. 
The second term in (|12() becomes 



= E goo E E £ ) ■ ft - t a 

z x x 

=E^ z )E E A M3(*w 

z 3 a; 

= • AjQ^ (16) 



It is easy to see that the inner summation in (|l(j|) is always nonnegative since by definition, Xq(z) assigns probability 
1 to B(Qx\z)- Now, for a given Q, define 



U max =arg max y\ B (Q xlz +u) - A B (q X| jJ Q x \z, (17) 
resolving ties arbitrarily. Then, we have, 

J2(x e Q (z)[x]-X Q ( Z )[x])-XjQ xlz 

X 

= (E (^q( Z )[*] ' A *) _ X B(Q(X\z))) Qx\z 

x 

<(A B (Q(x|z)+U max ) - X B(Q(X\z))^ Qx\z (18) 

<(a B (q(x| z )) - A B ( Qx|z+Umax )) U max (19) 
< max ||A a - A fc || 2 ■ ||U max || 2 (20) 

a,b£A 

<C A ■ e, 

where (1181) follows from (|17fl . I|19|) follows from the fact 

A S(Q X | 2 +U max )(Qx|z + U max ) < ^B(Q x| j(Qx|z + Umax), 
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and (|20|l follows from the Cauchy-Schwartz inequality. Note that depending on Q and z, (|18|l and l|19|) can be both 
zero and hold with equality. Together with l|15f) . the lemma is proved. I 

Before moving on to Lemma we need following three definitions. In Lemma [21 we have seen that for Q £ Q s k , 
D(-Pz||Q) is well-defined. Now, let's consider the case where Q € Q k is some function of the noisy observation Z n 
(denoted as Q[Z n ]). As mentioned in the footnote of Section 14-BI the notion of the relative entropy rate between P z 
and that random Q[Z n ] is defined in Definition [3 using Definition ^ Definition [3 is also needed for the inequality in 
Lemma 

Definition 1 Suppose Q[Z n ] <E Q s k . Iff is some function of (X 00 , Z°° ,Q[Z n \) such that the expectation 

E(f(X™,Z°°,Q[Z n ]j) = J /(x 00 , z°°, Q[z n ])dP(x°°,z°°) 
exists. Then, define the notation E(-) as following: 

That is, in E (f [X°° , Z°° , Q[Z n ])j , the Lebesgue integration with respect to the randomness of Q[Z n ] is excluded. 
Definition 2 Suppose Q[Z n ] £ 0|. Then, the relative entropy rate between Pz and Q[Z n ] is defined as, 4 

PziZolZZl) 



T>(P z \\Q[Z n })±E (log- 



Q[Z-](Z \ZZi 

Definition 3 Define the k-th order Markov approximation of Px for n > k as 

n 

P&)(X n )±P x (X k ) J] PxiX^Xtl). 

i=k+l 

Furthermore, denote Pz and P^' as ^ e probability law of the output of DMC, II, when the probability law of input 
is Px and P x , respectively? 

Now, we give following lemma that upper bounds the relative entropy rate between Pz and the ML estimator. 
Lemma 5 For the given sequence {Sk} defined in Section \4-A\ and for fixed k, we have 

lim T>(P z \\Q k ,8 k {Z n }) < T>(Px\\P { x k) ) a.s. 

n — >oo 

Proof: Recall that Qk,s k [Z n ] is an ML estimator in <d d k k based on the observation Z n . From QJ, we know that 



Urn T>(P z \\Q k ,8 k [Z n }) = min D(P Z ||Q) a.s. 

Also, © and Definition assures that P z k) £ Q 5 k k . Therefore, we have 

lim T)(P z \\Q ktSk [Z n ]) < T>(P z \\P z k) ) a.s. 



4 Notc that D(P^ is a function of Z n , and still is a random variable 

n (k) (k) 

a Here, P z is not the fc-th order Markov approximation of Pz, but is the distribution of the channel output whose input is P x , the 
fe-th order Markov approximation of the original input distribution Px 
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This is the link where we needed Assumption 1. Now, let's denote P( fc ) as the joint probability law of (A™, Z n ) when 

x 

P(X n ,Z ri 



the probability law of input process is P_jp . Then, by the chain rule of relative entropy [5J (2.67)], we have 



El log 



pO) (X n ,Z n ) 



--D n (P x \\Pp) + E[ log 



P{Z n \X n ) 
p(k)(Z n \X r 



-D n (P z \\P^) + E(lo g ^ n ^ 



p(k)(x n \Z r > 



/ p( z n \x n ) \ 

Since the DMC is fixed, we have El log P (k)r Z n\ x L-\ ) ~ ^' Moreover, by the nonnegativity of relative entropy, 
^(log pffffff) ) > 0. Therefore, we get D n (P z \\P z k) ) < D n {P x \\P {k) ). Since D(P X = lim™ ±D n {P x \\P {k) ) 

always exists by ergodicity, we have 

n(p z \\p^)<-D(p x \\p^), 

and the lemma is proved. ■ 

Proof of Theorem 1 We are now finally in a position to prove our main theorem. As mentioned in Section T4-BI 
we first fix k and e, and try to get the inequality in the form of (JSJ to prove Part (a). To refresh, © is given again 
here. 

limsup(i^ . (A",Z")-0„(Px,n)) < p(limsupD(P z ||Q^),, 
From the definition of L ±f Jl", Z n ), 

1 " 

(x n ,z n ) = -Y,e(Xt,x e Q i(z% 

univ,k Jl » ^k 



n 

t=i 



where from ©, we know that Q\ is a function of Z mi{ $. Since i(X t ,X e Qt (Z*)) is a function of (X t , Z\ Q[Z m ^]), 
we can define a quantity E(£(Xt, Xt. t (■£*))) from Definition^ From this, we also define 

^- k 

t=l 

Now, we have following Corollary ^ from Lemma |3 whose proof is given in Appendix 3. This corollary is a key step 
in proving the main theorem, since it provides a crucial link that enables to get the inequality in JfjJ . 

Corollary 1 For fixed k and e, we have 

^( L x,„,„, fc (^^")-^(^„,„. fc (^^"))) =0 a,. 
From Corollary ^ we have following equality 

limsupfL^ . (X n ,Z n )-ct> n (P x ,Ilj) 
= limsup(p(l^ . (X n ,Z n )) -</» n (P x ,n)) a.s. 



Therefore , to get the inequality of the form of 10, we can equivalently show 

limsupfpfi^ . (X n ,Z n )) -0„(P x ,n)) <p(limsupD(P z ||Q^),e 
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a.s. 



Now, let's consider following chain of inequalities: 

e(v t (r,r))-^(p x ,n) 

i=l 
1 " 

---YjE^E^iXuX^iZuZt- 1 ))^- 1 ) -E^iXuXpiZ^Z*- 1 ))^- 1 

n 

■^2 E \\Pz t \Zt-^ - QkX t |Z'-i||l 



< 



< 



t=l 
KnK 



C\ • e a.s. 



V2h[2K n A n 



/ 



P(z t l^-1) 



Z*- 1 +C A -e a. 



<V2 \n2K n K 



71 

•E^( lo s 



p^ti^*- 1 ) 



Ca ■ £• a.s. 



(21) 
(22) 

(23) 



(|21|l is obtained from Lemma^J since II does not vary with t, and given Z t_1 , estimating X t based on Z 4 is equivalent 
to the single letter setting as in Lemma0]with the corresponding conditional distribution. Also, (|22|l is from Pinsker's 
inequality, and l|23|) is from Jensen's inequality. By taking limsup on both sides, we have 



lim sup I El Lyj. _(X",Z")) -0 n (P x ,n 
<V2 In 2K n A 



1 » , PtZAZt- 1 ) 
limsup log Qi(Zt|zt _ 1 } 



Ca • e a.s. 



since the square root function is a continuous function. For the expression inside the square root of the right-hand 
side of the inequality, 

limsup- (log 



= lim sup P 



t^oo \ ° s QUZolzzL) 

limsupD(P z ||Q[.) a.s. 



(24) 

(25) 
(26) 



where <|24ll is from Cesaro's mean convergence theorem; (1251 is from the fact that P(Z \ZZ}) -> P(Z |ZI^) almost 
surely by martingale convergence theorem, and Qp\.{Z^Z t ~ v ) — * Q^.(Zo\Zzlo) almost surely by Lemma ^ an d if^l) is 
from Definition [3 Therefore, 



limsup ( e (l ±c (X n , Z n )) - <f> n (P x , n 
limsup (l^ . (X",Z")-0„(P x ,n) 



<2V2 In 2P n A ma:E 4 /lim sup D(P Z || Q\ ) + C A ■ e a.s. 

t — >oo 



(27) 
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which finally is in the form of |jf)|). Now, we need to check if the right-hand side of l|27[l goes to zero if we let k — > oo 
and e J, 0. To sec this, consider following further upper bounds. 

limsupD(P z ||Qt) 

t— >oo 

= limsu P D(P z ||Q Mfc [Z*]) (28) 

t— >oo 

<D(Px\\Px h ), (29) 

where <|28[) is from the fact that — > oo as t — > oo, and l|29|) is from Lemma The inequality (|29|l holds for 
every k, and by Shannon-McMillan-Breiman Theorem, we know D(Px\\Px k ) —* as k — > oo. Therefore, 

lim limsupD(P z ||Q£) = 0, 

k >oo i — too 

and thus, 

lim limsup (l ±c (X n ,Z n ) - (j} n (P x ,U)) < C A • e a.s. 

Finally, sending e to zero, Part (a) of the theorem is proved. Part (b) follows directly from (a), and Reverse Fatou's 
Lemma. That is, 



lim limsup (e(l ±c (X n , Z n )\ - <p n (P x ,U) 
= lim lim sup .©(i*, (X n ,Z n ) - ef> n (P x ,IV)) 

< lim £( limsup (l ±c (X n ,Z n ) - 4> n (Px,U)) ) 



<C A e 

Note that the expectation here is with respect to the randomness of probability law within the paranthesis, too. By 
sending e to zero, Part (b) is proved. ■ 

5 Extension: Universal filtering for channel with memory 

Now, let's extend our result to the case where channel has memory. With the identical assumption on {^Tt}, now 
suppose {Zt} is expressed as 

Z t = X t © N t (30) 

where denotes modulo-M addition, and {Nt} is an „4-valued noise process which is not necessarily memoryless. 
We assume we have a complete knowledge of the probability law of {Nt}- Specifically, let's consider the case where 
{N t } is FS-HMP, that is, it is an output of an invertible memoryless channel T = {T(i,j)}i,jeA whose input is 
irreducible, aperiodic £-th order Markov chain, {St}, which is independent of {^t}- Let r m j n = mini.j 6 ^4{r(i, j)}, 
and suppose r m ; n > 0. For simplicity, assume that the alphabet size of {St} is also A. 

In this model, the channel between X t and Z t at time t is an M-ary symmetric channel, which is specified by the 
St-th row of r. Let's define an M x M matrix II t whose (xt, Zt)-th element is 

n t (a; t ,z t ) —P Nt (z t G x t ) 

=Pr{Z t = zt\X t = x t ) 

= /^2 Pr ( z t = z t\ x t = x t , S t = s t )Pr(S t = s t ), 
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where G denotes modulo-M subtraction. Now, let's make following assumptions on the noise process. 

• {iV t } is stationary, i.e., U t is identical for Vt 

• Tl t is invertible 

• 3a such that Pr(S t |S*lj) > a > 0, for VS^w) 

As stated in [221 2- A], the first and the second assumptions are rather benign. Especially, for the second assumption, 
it can be shown that under benign conditions on the parametrization, almost all parameter values except for those 
in a set of Lebesgue measure zero, give rise to a process satisfying this assumption. Also, since this only corresponds 
to the case when k = in |221 Assumption 1], it is a much weaker assumption. The third assumption is a similar 
positivity assumption as Assumption 1, which enables our universal filtering scheme. 

Under these assumptions on the noise process, we can extend our scheme to do the universal filtering for this 
channel. First, we can convert this channel to the equivalent mcmorylcss channel, S = ^)}i,j,h.e^l > where 

the input process is {(X t ,St)} and the output is {Z t }. Here, S is M 2 x M matrix, and the channel transition 
probability is 

mJlh)=T(j,hei) V»,i,*. 

To do the filtering, we apply our scheme to this equivalent memoryless channel. For fixed k > £, as in Section l2-B.ll 
define a parameter set of HMPs, 8^, whose Markov chain has M k+e states, and the memoryless channel is E. The 
fc-th order conditional probability of our new input process is 

PriXuS^XlzlStl) 
=Pr(X t \X t t Z 1 k ) ■ Pr(S t \Slz}) 

>S k ■ a. (31) 

where <|31[) is from Assumption 1 and the third condition on the noise process. Let 7^ = 8^ ■ a. Then, we can 
model {Z t } in 0^ fc , or equivalently, model (X t , S t ) as fc-th order Markov chain, and obtain Q\, the ML estimator in 
0^* based on Z m< «. By forward recursion, we can get Q\{Xt, St\Z*), and by summing over St a we can calculate 
QfeXJz*- Then, finally we define our sequence of universal filtering schemes as, 

exactly the same as we proposed in Section 14-AI 

The analysis of this scheme is identical to the one given in the proof of the main theorem. I|21(l . which is the only 
place where the invertibility of the II is used, can also be obtained in this case due to the second assumption of the 
noise process. Thus, we again get 



lim sup (i*. . (X n ,Z n )-(/, n (P x ,U) 

71— *OC ^ umv,k 



<2V2 In 2A" n A ma x , Aim sup D(P Z \\Q{) + C A ■ e a.s. 
y t^oo 

Since 

limsupD(F z ||Q*.) = HmsupD(P z ||Q fc , 7fe [Z t ]) < B(P x \\P Xk ) 

t — >oo t — >oc 

by the same argument as Lemma we have the same result as Theorem 1. Thus, we can successfully extend our 
scheme to the case where the channel noise is FS-HMP with some mild assumptions. 
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6 Conclusion and future work 



In this paper, we proved that, for the known, invertible DMC, a family of filters based on HMPs is universally 
asymptotically optimal for any general stationary and ergodic {X t } satisfying some mild positivity condition. That 
is, we showed that our sequence of schemes indexed by k and e achieves the best asymptotically optimal performance 
regardless of clean source distribution. We could also extend this scheme to the case where channel has memory, 
especially where the channel noise process is FS-HMP. The future direction of the work would be to ascertain the 
relationship between k and n, such that we can devise a single scheme that grows k with some rate related to 
n. Attempting to loosen the positivity assumption that we made in our main theorem and extending our discrete 
universal filtering schemes to discrete universal denoising schemes are additional future directions of our research. 
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Appendix 1 

Here, we revise three lemmas from £Q regarding probability law of HMP. These are needed to prove Lemma ^ F° r 
the following three lemmas, fix k and 5, and suppose Q e Q 5 k . Also, fix some m £ N, such that m > k. Proofs are 
similar to ^ Appendix]. Note that {X t } is still our clean signal and {Z t } is the noisy observed signal (not necessarily 
a HMP). 

Lemma 6 We have 

where M<5,m,fc = (1 + (s-nl~)™+i* is independent of Q, Zf^i, j. 
Proof: 

Q(X t+m =j\X t = i,Z? 00 ) 

Q(X t+m =j'\X t = i,Z™ 00 ) 
_ Q(X t+m =j,Z? O0 \X t =i) 

Q(X t+m =j',Z2° 00 \X t =i) 
_ Q(X t+m =j,Z~ +m+k+1 \X t = i) Q(Z t +™ +k \X t = 2 1 X t+m =j) 

Q(X t+m =j',Z™ m+k+1 \X t = i) Q(Z t +™+ k \X t = i,X t+m =f) 

Now, let's bound the terms in First, 

Q(X t +m = j, Z^_ m+k+l \X t = i) 

Q(X t+m = f, Z^ m+k+1 \X t = i) 
_ 12 j„ Q( x t+ m +k = jo, X t+m = j, Z^_ m+k+1 \X t = i) 

Sjo Q( x t+m+k = Jo, x t+m = i' i z ^ m+k+1 \X t = i) 
_ J2j a ija'jj Q(Z^ m+k+1 \X t+m+k = jo) 
~E~ a™,a k .,. o Q(Z°° m+k+1 \X t+m+k = j ) ' 
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Note that a™- > 5 m and a k ^ Q > S k , Vi, j,jo from the assumption of 9^. Let Q(Z^_ m+k+1 \X t+m +k = jo) = o>j . Then, 
the last expression is 



°^3 ^jo Q 3jo a Jo 

a m ., V ■ a fc , a 

ij ^3o j jo ■ 



(33) 



Jo 



Since 



we have 



Ejo a jjo a 3° 



™, . jo V a*-, . / 



T\- a 9n a", . 



J 30 



^<^I max f4^ N ) < max f-^S-W-ljr. 

~ a m , io V a fc , . / ~ i,j,j',j \a m .,a k , . / ~ <5 m + fc 
y j 30 y j jo 



(34) 



Now let's look at the second term in 132|) . That is, 

Q{ZlX? +k \X t =i 1 X t+m = ] ) 



Q(Zt+r +k \Xt = i,X t+m =f) 

7t-\-m-\-k 



<- 



E XT Q{zl+? + *\x t = i,X t+m = j,X T = x T ) ■ Q(X T = x T \X t = i,X t+m = j) 
E* T Q{ZltT +k \Xt = i,X t+m = f,X T = x T ) ■ Q(X T = x T \X t = i,X t+m = j) 

1 



(35) 



CTT • \ m + k 

where T = {t + I,- ■ ■ ,t + m+ k}\{t,t+m}. Thus, from (£2} and 

Let now pj = Q{X t+m = j\X t = i,^^), then 1 = pj + Y,j'jtjPj' < Pi + ( M - 1) (s-nJ!i)™+>" and thus ' ft' - 
(1 + rg.jj ■ . ~ym+h which proves the lemma. 

Lemma 7 Consider following two arbitrarily given sets. 

C t G X t °° = [x r : T C Z> t U {oo}} and 
G 4 | ZT : TC ZU{oo,-oo}|. 



For d G N, define 



Af+ 4 maxQ(C t |X t _ (2m = i, J D), 
M d " 4 mmQ(Ct\X t - dm =i,D). 



Then, 



M+-M d <(ps, k ,m) 



d-1 



where p<5,fc,m = 1 - 2p, s ,k,m- 

Proof: From the argument of Lemma |SJ it is easy to see that 

Q(X t+m = j\X t =i,D)> P8,k,rn, 
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independent of D, too. Now, define 

7i (d) = Q(C t \X t _ dm = i,D) 

Pi 3 {d) = Q{X t -dm = j\X t -(d+l)m = i,D) 

i+(d) = argmax<5(Ct|X t _ ((i+1 ) m = i,D) 

i 

i~(d) = arg min Q(C t | X t _ dm = i, D). 

i 

Since S,k and m are fixed, let's simply denote ju, = fis,k,m- Also, let's omit d and the parenthesis for above four 
quantities to simplify notation. Then, 

M d+i =Q(C t \X t _ [d+l)m = i+,D) =J2ljPi+j 

3 

=/iM d - + - M )M d - + ^ 7,-A +j (36) 

=/iM d - + (1 - M )M+ (37) 

where (1361) is possible from Lemma |5| since > /i for Vi, j. 
By the similar argument, we get 

M d+1 >/iM+ + (1 - fi)Mj (38) 

By subtracting l|38|l from l|37|) . we get 

- < (1 " 2M)(M+ - M d -) < •• ■ < (1 - 2/i) d 

and, thus proves the lemma. Note that since /j, = fis,k,m < \i and thus, < ps,k,m < !• Also, the result does not 
depend on Q. 



Lemma 8 

-(d+i) 



for Vp, Vd > 1, and < Z < m - 1. 
Proof: 

Q{ C t\ Z t-(d+l) m -l) 

= ^2Q(C t \Z^ {d+1)m _ v X t _ {d+2 ) m = j)Q(Xt-(d+2)m = j\ Z f-(d+l)m-l) 
j 

and therefore, 

M d+2 < QiCtlZl^.^) < M+ +2 

On the other hand, 

Q(Ct\zi dm ^) 

_ \ A C\(H I 7 p \n( yt-dm-l-l _ t—dm—l—l \ yP \ 

~ ^\^ t \^t-{d+\)m-l)^\^ J t-{d+X)m-l ~ Z t-{d+l)m-l\ Zj t-dm-l) 

t — dm—l — l 
t—{d+l)rn-l 
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and thus, 

Therefore, from Lemma we have 

\QiCt\z p t _ dm -d - Q^IC^+d^)! < M t2 mj +2 < ( P5 , k , m ) d+1 

Note that the result does not depend on either Q or I. 

Appendix 2 

Before proving Lemma we need following lemma first. Part (b),(c), and (d) are crucial for Lemma and Part (a) 
enables Part(b). Part (a) is the reason why we need a randomization of the filter. 

Lemma 9 Suppose Q £ <d s k and fix 6 > 0. 

(a) We have 

W^Q^-ti) - X Q{ z -t 2 )h < M 2 ■ ||Qx |z« ti - Qx |z° t2 111, 

where t\, £2 > are arbitrary integers. That is, for any integer t > and any individual sequence z^_ t , Xq(z°_ t ) 
is a Lipschitz continuous function in Qx \z" t - 

(b) l(X 0) X* Q (Z _ t )) -> iiXoJ^Z ^)) a.s. uniformly on 9* 

(c) For VQ € 6|, and Vw, 3 < 7 < l,/3 > 0, such that \Q{X Q \Z°_ t ) - g(X |Z° oo )| < py*. 

(d) For fixed t,rj > 0, 3 some finite set !Fk(t,r]) C 0f,, such that 

max min max \Q(xo\z_ t ) — Q'(xo\z_ t )\ < i] 
geef Q'&r k (t,n) x ,z a _ t 

Proof: 

(a) For given simplex vector Q, fixed x, and B e defined as in Section 14-AI we define followings. 

• S$(Q) = {W e B t : B(Q + W) = x} 

• DP(x) 4 |c T y = : y e M M ,c = A fi - A„,Va e A\{x}} 

• dist(Q, c T y = 0) = The shortest L2 distance from a simplex vector Q to the plane c T y = 

That is, Sx(Q) is a set of vectors in e-ball, B e , that makes the Bayes response B(Q + W) equal to x. Also, 
DP(x) is a set of decision planes that separate the decision region for the reconstruction alphabet x and other 
alphabets. Then, for some fixed t, by definition, 

_ Vol(^(Q Xo|z o_ t )) 
*Q^-<^J " Vol(£ e ) ' 

where Vol(-) is a volume of a set. Since Vol(_B € ) is a constant, for any t\ and £2, we have 

» n . n |Vol(5 £ (Q Xo | z o t ))-Vol(5i(Q Xo |,o ))| 
\X* Q (z°_ ti )[x] - X* Q (z°_ t2 )[x]\ = Vol(B 6 ) (39) 
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For the numerater, as a crude bound, we get 



|Vol(&(Q Xo |,o ti ))-Vol(^(Q Xo |,o t2 ))| 
KVoHB?- 1 ). J2 |dist(Q Xo | 2 o ti ,c T y = 0)-dist(Q Xo |,o t2 ,c T y = 0) 

c T y=06BP(i) 



where Bf~ x = {U e R M ~ l : ||U|| 2 < e}. Since 



we have 



dist(Q,c T y = 0) 



|c T Q| 
llclla • 



(40) 



dist(Q Xo | z o ^ , c T y = 0) - dist(Q Xo |,o ^ , c T y = 0) 
\c T Qx \z°_ H \- \ cT Qx \z«_ t2 \ 



< 




I|C||2 


< 


lQ^olA a 


- Qx \z°_ t2 h 


< 


IQx iA a 


- Qxo\z°_ t2 \\i 



C 2 



(41) 

(42) 
(43) 

where 1(41(1 is from the triangular inequality, (|42|l is from Cauchy-Schwartz inequality, and 1)43(1 is from the fact 
that L2-norm is less than or equal to Li-norm. Therefore, l|4(J|) becomes 

|Vol(5 £ (Q Xo | a o ti )-Vol(5 £ (Q Xok o t2 )| <M • Vol(Sf - 1 ) ■ ||Q Xo |,° ti - Qx |*o_J|i, 

and thus, 1(39(1 becomes 

|X£(z° tl )[4] - XUz\)[x]\ <M ■ • \MXo\z\) Q(*o|*° ta )||i. 



Vol(B e ) 

< M ■ \\Qx \z° - Qx \z°\h- 



Therefore, we have 



\\x* Q {z\) - x* Q {£. u )\\i < m 2 ■ \\Q Xolz o_ ti - Q Xok o t2 IK, 

and Part (a) is proved. 

(b) By the exact same argument as in proving Lemma 1, we can easily know that Q(Xo\Z°_ t ) — > Q(Xq\Z°_ oq ) for 
Vui, uniformly in VQ € 0^ fc - Since we have 



= | £ A(X ,i)(^(Z° ( )[x] - X^Z^x] 

X 

<A max \\X^(Z°_ t ) - X^Z^Wr 
<A max M 2 ■ \\Q(X \Z°_ t ) - Q(X |Z° oo )||i, 



we get the uniform convergence. 
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(c) Again, let's follow the argument in the proof of Lemma 1. Suppose t = jk + I, where j = \t/k\, and I = t 
mod k. Then, 



|Q(X |Z°t)-Q(Xo|Z°oo)l 
=|Q(X |Z° ifc _ i )-Q(Xo|Z° 00 )| 

oo 

< lo^oi^a-,) - o(Aoiz° (i+1)fc _,)i 



1=3 
oo 

=^ = pLVfcJ = ^ll(pV*)* (45) 



1-p 1-p 

<-^(p lA ? (46) 

where p = pa^fc as defined in Lemma 7, and l|44|) follows from Lemma 8. By letting (3 — jzr, and 7 = p 1 ^, 
we have proved Part (c). 

(d) We know that for the individual sequence pair (a^z ^), 

£ g= jQ(*°- t ,*°,) 

0( ^ o|z - ) = Q(^) 

_ £ a =jQ(s° t ,*° t ) 

_ E, = ;(Q(^)nt- t n(^,z z ) 
~E^ t (Q(«°-t)IlL t %,«i)' 

For Q G 6|, II is fixed and we can think of Jli=-t ^(xi, zj) as a constant for the individual sequence pair 
(x°_ t , z°_ t ). Since 

Q(o; _ t ) = Q(x^ 1 - t ) fl , 

j=k-t 

Q(xo\z^_ t ) is the ratio of two finite order polynomials of {a^}, and as 6^ is closed and bounded, Q(xo\z°_ t ) is 
a uniformly continuous function of {(%}. Therefore, for given rj, 3e(ij) such that \\Q — Q ||i < e{rj) implies 

max |Q(so|*° t )-Q'(so|«°t)| <»?, 

since there are only finite number of possible (xq, z°_ t ) pairs. Also, since & s k is compact, we can always find a 
finite set, ^{t, rj) that for any Q G Q s k , there exists at least one Q' G ^(t, rj), that satisfies \\Q — Q ||i < t{rj). 
Therefore, Part (d) is proved. 
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Proof of Lemma\^; To prove Lemma [21 first consider following limit. 

lim EfL^ (X n ,Z n )) 

1 n 

= lim - Y^EUiX^X^Z 1 )) 

n—too n \ ^ j 



1 = 1 

--lunE^XuX^Z')^ 
--E^Xo.X^Z^))) uniformly on &{, 



(47) 
(48) 
(49) 



where (|47|) is from Cesaro's mean convergence theorem, (I48|l is from stationarity, and (|49|l is from Lemma |5fb) and 
bounded convergence theorem. Thus, to complete the proof, we need to show that 



lim L ±t (X n ,Z n ) = E[l(X Q ,Xh{Z"))) a.s. uniformly on Qi 



(50) 



Now, let's show the pointwise convergence in (|50|l without the uniformity by using ergodic theorem. For given Q, 
define 



g Q (X,Z) 4 ((Xo^^Z ^)) 

and denote by T the shift operator. Then, what we should prove becomes 

1 ™ 

lim - y^g t , Q {T t (X,Z)) = E(g Q {X,Z) 

t=l 

while the ergodic theorem gives 



Observe that 



1 " 

lim - y2g Q (T t (X,Z)) = E(g Q (X,Z) 



1 n 1 n 

I - £ g t , Q (T\X, Z))--J2 9Q{T\X, Z)) 
t=i t=i 

1 - i 

^ E IdtMT'iX, Z)) - 5Q (T*(X, Z)) 
t=i 



t=i 



Since Lemma Iffic) holds for Vcj, we can think that the lemma holds for all individual sequence pair (xo, x^). 
Thus, it holds for all individual pair (xt,^^), too, and we can conclude that Q(X t \Z{) — ► Q(X t \Z t _ oc ) for Vcj as 
t — ► oo. Hence, by exactly the same argument as Lemma|[a) and Lemma Hfb), we conclude that £(X t , Xq(Z{)) — > 
-X"q(^1oo)) almost surely as t — > oo. Now, by Cesaro's mean convergence theorem , we obtain 



a.s. 



24 



Therefore, we get 



L 



^(X^Z^^E^X^X^Z ^))) a., 



Note that up to this point we cannot guarantee the uniformity of the convergence, since the ergodic theorem 
only gives the individual convergence for each Q. To show the uniformity of the convergence in i|50|l . first define the 
following quantity for some fixed integer t € [l,n — 1], 

t n 

L ±h jx*,z*) = -(J2e(Xi,x e Q (z i ))+ E ^x^zu))). 

i=l i=t+l 

From Lemma[5fd), for any Q £ 0| and fixed t,rj > 0, we can pick some Q' € Fk{t, rj) such that \\Q — Q \\i < e(?y), 
and thus, 

max \Q(x \z _ t )-Q'(x \z _ t )\<r). 

x ,z u _ t 

By adding and subtracting some common terms involving such Q , and from the triangle inequality, we have, 

L^(^Z")-£(£(X ,^( Z -oc))) 
< L<rc (X n . Z n ) — i-o-c (X n ,Z n ) + L* c (X n ,Z n )-L^ t (X n ,Z n ) + L icc (X n , Z n ) - (X n , Z n ) 

~ X Q ' X Q,t X Q,t X Q',t X Q',t X Q' 

+ (X n ,Z n ) -£^(X ,^'(^-oo))) | + \E(l{X ,^{Z^)))-E(l(X ,^Z Q _ oo )j)\ (51) 



Now, the goal becomes to show that the terms in the righthand side of the inequality converges to zero independent 
of Q as n, t, and 77 varies. First, we will bound each term, and send n — ► 00. 



(1) 



L*. (X n , Z n ) - L*. (X n ,Z % 

I X Q X Q,t 
1 ™ 1 

<- J2 Y{X u Xq{Z 1 ))-Z(X u Xq{ZU)) 

' i=t+l 

<A max -i J2 \\X' Q {Z l )-X^{ZU)\\i 

i=t+l 

1 - 

<A max M 2 • - 2^ \\Qx \z°_ i ~ Qxo|z°Jli 

i=t+l 
n 

<A max M 3 • - V 09-y* + /3 7 *) 

i=t+l 

— > A max M 3 ^7 t a.s. uniformly on 9f 



(52) 

(53) 
(54) 



where (|52|l is from stationarity and Lemmal^a), H53|l is from LemmaEIc), and (|54f) is from the Cesaro's mean 
convergence theorem. Since (|53|) does not depend on Q, the limit is uniform on 0-L 
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(2) 



(X n ,Z n )-L^ (X n ,Z ri 

I Q,t Q'.t 

1 - t 
<- J] l^.l^U) - £(Xi,^(^_ t ))| + - • A, 



i=t+l 



1 ™ + 

<A max -- J2 \\xh( z $-t)- x Q'( z U)h + --^ 



i=t+l 



1 " t 
<A max M 2 -- £ ||Q Xi|z| -Q' Xi{Z i ||i + - 



An 



i=t+l 



<A max M 3 r] + - -A max 

n n 

— > A max M 3 ?7 a.s. uniformly on Oj 



(55) 
(56) 



where l(55|) is from Lemmata), and 156(1 is from Lemma^d). Since Q56JI does not depend on Q, the limit is 



also uniform on 9^. 



(3) 



(X n ,Z n )-L^ (X n ,Z n ) 



KnaxM 3 ^ 0.8. 



by following the same argument as (1). Since Tk(t,rf) is finite, this convergence is uniform on Fk(t,rf). 



(4) 



a.s. 



from the proof of pointwise convergence above. As in (3), this convergence is also uniform on Fkit, r]). 



(5) 



(l{X 0l X f QI {Z^))) - E^X^X^Z^)) 

'(l(X Q , -E(£(X ,X* Q ,(Z°_ t )))\ + \E(e(X ,X' Q ,(Z°_ t ))) -E[l(Xo,X' Q (Z°_ t )) 

+ \E{£(X ,X^Z° t ))) - E^XcXQiZ ^)) 



< P{^z^)l(x 0l X^(z a _ x ))-l(x^X^{z _ t )) + P(xo,z _ t )£(x ,X^(z _ t ))-£(x ,X^ Q (z°_ t )) 



+ ]T P(xo,z°_J e(x ,X* Q (z _ x ))-e(xa,X* Q (z°_ t )) 



<A max M' 

by similar argument as in (1) and (2). 
Therefore, by taking limit supremum on both side of l)51[l. we get 

lim sup 



L % (X n ,Z n ) - E^X^X^Z^)) 
<A max M 3 ^4/?7* + a.s. uniformly on 6 
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Since t and r\ are arbitrary, by sending t — > oo and 77 J. 0, we have 



lim sup 



< a.s. uniformly on O^. 



Therefore, the lemma is proved. I 

Appendix 3 

Here, we prove Corollary 

Proof of Corollary^ First note the subtle point that Corollary ^ does not directly follow from Lemma|21 Since the 
probability law Q l k that we are using to filter each block is changing every block, whereas the uniform convergence 
in Lemma|21is for the fixed Q € 0^ fc for all t, it is not enough to guarantee the Corollary. However, since Q\ remains 
the same within each block, we can still use the result of Lemma 01 if the block length gets long enough. Keeping 
this in mind, let's take a more careful look at each block. In the proof, for the brevity of notation, let's denote 

e t (Q)±e(x u x* Q (z% 

since we are always dealing with the randomized filter, and there is no possibility of confusion. Now, fix any 6 > 0. 
Then, from (3J, 



in tui-i S 
dl, such that < — — 



mi 



and from Lemma |3 



3N, such that max 



L ±t (X n , Z n ) - EL ±l (X n , Z n ) < 5/4 



Recalling the definition i(t) = max{i : to, < i}, we let I a = max(/,i(iV) + 1). Then, for any n > to/ , and 



\Ly- c {X n ,Z n )~ EL ±e (X n ,Z n )\ 



< 



(tt(Qi) - E(£ t (QD) 

n 

E 



t—mi( n \-\-l 



£t(Q[Z m ^>}) - E{l t {Q[Z m ^])i 



{e t (Q[Z m ^}) - E{£ t {Q[Z m ^])) 



(57) 
(58) 

(59) 



Note that in the second and third term, Q{ is fixed to QIZ™^- 1 ) and Q[Z m ^] fr om the definition of our filter. 
Now, we can bound each term. For the first term, since n > m;.(„\ > mr, we know that m ' < " ) ~ 1 < T "' ( " ) ~ 1 < -^A — . 
Therefore, 

1 



(£ t (Ql)-E(l t (Ql)) 



— Qf> c max — o ' 
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For the second term, since n > > N, 



<- 



1 

n 

m 



J2 (tt{Q[Z m ^\) - E{l t {Q[Z m ^])) 

t=TOj(„)_i + l 

m i(») 

J2 (lt(Q[Z m ^-i]) - E{i t {Q[Z m ^])) 



(60) 



i(n) 



n m 



i(n) 



t=i 



m«(„)-i 



t=l 



-A HP ax ~ 8 

Finally, for the last term, 



1 

<- 

n 



J2 (it(Q[Z m ^}) - E(£ t (Q[Z m ^}))) 



n 1 ™«(«) 

^(^(Q[z m -(")])-^ t (Q[^<"»]))) +- E (^(Q[^ mi( " ) ])-^(^(Q^ mi< " ) ]))) 

t=l n t=l 



<5 5 S 
<- + - = -. 
~4 4 2 

Therefore, for any n > mi , and mj(„) < n < mj(„) + i, we have 

L^. fc (X n ,Z") k (X n ,Z n ) 
and since <5 was arbitrary, we have the corollary. ■ 



(61) 
(62) 

(63) 

(64) 
(65) 



<S, 
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